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Figure 1: The Attentive Perceptron adaptively allocates computational effort according to how hard an example is 
to classify. While the traditional Perceptron evaluates all the features for all the examples, a Budgeted Perceptron 
can only evaluate a constant number of features which is limited by the hard budget. From a budgeted learning point 
of view, the Attentive Perceptron adaptively allocates computation while maintaining an average budget. Therefore 
easily classifiable examples are filtered after having evaluated a few of their features, whereas hard to classify 
examples have the majority of their features evaluated. 



Abstract 

We propose a focus of attention mechanism to speed 
up the Perceptron algorithm. Focus of attention speeds 
up the Perceptron algorithm by lowering the number of 
features evaluated throughout training and prediction. 
Whereas the traditional Perceptron evaluates all the fea- 
tures of each example, the Attentive Perceptron evalu- 
ates less features for easy to classify examples, thereby 
achieving significant speedups and small losses in pre- 
diction accuracy. Focus of attention allows the Atten- 
tive Perceptron to stop the evaluation of features at any 
interim point and filter the example. This creates an 
attentive filter which concentrates computation at ex- 
amples that are hard to classify, and quickly filters ex- 
amples that are easy to classify. 

1 Introduction 

Many Online Algorithms base their model update on 
the margin of each example. Passive online algorithms, 
such as Rosenblatt's Perceptron Q and Crammer et al's 
online passive-aggressive algorithms |3 1, update the al- 
gorithm's model only if the value of the margin falls 
below a defined threshold. These algorithms fully eval- 
uate the margin for each example, even if the model is 
not to be updated! 

The running time of these algorithms is linear either 



in the number of features, or in the dimensionality of 
the input space. Contemporary models may have thou- 
sands of features making running time daunting. The 
budgeted learning community addresses this problem 
by putting a budget on the number of features a classi- 
fier can evaluate while learning and while making pre- 
dictions. Our work stems from the theoretical frame- 
work suggested by Ben David and Dichterman [1 j, and 
is closely related to recent work by Cesa-Bianchi et al. 
J2) as well as Reyzin (6). 

We differ by the fact that we do not impose a hard bud- 
get constraint on the number of features, but rather look 
at the probability of making decision errors. Decision 
error are errors that occur when the algorithm stops the 
feature evaluation process, predicts its outcome, and is 
wrong. This work extends on previous work by Pelos- 
sof et al. Q. 

We propose a new method for early stopping the com- 
putation of feature evaluations for uninformative exam- 
ples by connecting the Perceptron algorithm to sequen- 
tial statistical tests HI |4) (Figure [T}) This connection 
results in a general method that makes margin based 
learning algorithms attentive, which means that they 
have the ability to quickly filter uninformative exam- 
ples. 



2 The Attentive Perceptron 

The margin of each example is computed as a weighted 
sum of feature evaluations. Informative examples are 
misclassified examples, which force the Perceptron to 
preform a model update, whereas uninformative exam- 
ples are correctly classified and therefore ignored by the 
perceptron. 

We break up the feature evaluation for every example 
in the stream. The breakup of every example allows the 
Attentive Perceptron to make a decision after the eval- 
uation of each feature about whether the feature eval- 
uation should continue or be stopped. This decision 
making process allows us to stop the evaluation of fea- 
tures early on examples with a large partial margin after 
having evaluated only a few features. For example, ex- 
amples with a large partial margin are unlikely to have 
a negative full margin. Therefore, rejecting these ex- 
amples early achieves large savings in computation. 

We define the mathematical setup to derive the stop- 
ping conditions for margin evaluation. Let Xi, X n 
be weakly dependent random variables. Let a partial 
sum be defined by Si = X\ + ... + X{ and the remain- 
der sum by Si n = S n — Si. The expectation of a sum is 
denoted by ESi and its standard deviation by std(Si). 

The Perceptron compares the margin (a sum) to a 
threshold, and updates its model if the margin of the 
example is negative. We formulate the equivalent se- 
quential decision making process, and drive constant 
stopping thresholds r. These thresholds will essentially 
tell us when it's highly unlikely for the margin to end 
below the desired importance threshold 9. 

The stopping thresholds are derived by requiring that 
the joint distribution of stopping (and predicting S n > 
9) while the actual full sum satisfies S n < 9 is less than 
a required error rate 6 

P(S n < 9, predict S n > 9) = P(S n <9,Si>r) < 8. 

We bound the probability of making a decision error 

P(S n <6,Si>r)< P(S n <9,Si = t) 
= P{S n -ES n <9-ES n ,S l = r) 
= P(S n -ES n <2r-(9-ES n )) (1) 
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Equation [T] is derived by applying the reflection princi- 
ple, and equation [2] is its standardization. 



Since we assume that Xl, X n are weakly indepen- 
dent, the sum S n = X\ + ... + X n is approximately 
normally distributed by the Central Limit Theorem. By 
standardizing S n we upper bound the probability of 
making a decision error with the inverse normal cumu- 
lative distribution function Therefore, requiring 
that the probability of making a decision error be less 
than 6 we get the following equality from equation [2] 



2r - 9 + ES n 
std(S n ) 



= $" 1 (l-(5). 



(3) 



The quantities ES n and std(S n ) can be approximated 
using the empirical data. 

Finally, by solving for the stopping threshold r we get 
from equation [3] 



T 



= ±(9-ES n + std(S n )<$>-\l-5)) 



(4) 



Therefore, examples with partial margin calculations Si 
that hit this boundary should be filtered and with prob- 
ability at least 1 — 6 determined that their full margin 
satisfies S n > 9. 

In summary, we presented a simple test to speed up the 
Perceptron algorithm by quickly filtering unimportant 
examples without fully evaluating their features. This 
results in an algorithm which typically focuses on ex- 
amples by the decision boundary - the Attentive Per- 
ceptron. 
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